Predictive modeling of evoked intracranial EEG response to medial temporal lobe stimulation in patients with epilepsy

Despite promising advancements, closed-loop neurostimulation for drug-resistant epilepsy (DRE) still relies on manual tuning and produces variable outcomes, while automated predictable algorithms remain an aspiration. As a fundamental step towards addressing this gap, here we study predictive dynamical models of human intracranial EEG (iEEG) response under parametrically rich neurostimulation. Using data from n = 13 DRE patients, we find that stimulation-triggered switched-linear models with ~300 ms of causal historical dependence best explain evoked iEEG dynamics. These models are highly consistent across different stimulation amplitudes and frequencies, allowing for learning a generalizable model from abundant STIM OFF and limited STIM ON data. Further, evoked iEEG in nearly all subjects exhibited a distance-dependent pattern, whereby stimulation directly impacts the actuation site and nearby regions (≲ 20 mm), affects medium-distance regions (20 ~ 100 mm) through network interactions, and hardly reaches more distal areas (≳ 100 mm). Peak network interaction occurs at 60 ~ 80 mm from the stimulation site. Due to their predictive accuracy and mechanistic interpretability, these models hold significant potential for model-based seizure forecasting and closed-loop neurostimulation design.

seizures, as well as long-term benefits such as the 9 modification of cortical excitability via neuroplasticity [1,2].Nevertheless, its state-of-the-art clinical implementation, the RNS ® system from NeuroPace, Inc. [1] has known limitations, including a lack of mechanistic understanding [3], simplistic detection algorithms, and non-adaptive pre-selected stimulation parameters [1,4,5].As a result, though highly effective in some individuals, RNS outcomes vary widely among patients, and only 18% of those receiving RNS implantation experience seizure freedom for a period of 1 year or longer [6].To establish neurostimulation as a robust treatment with predictable outcomes, it is vital to have the ability to predict the future trajectory of neural activity, both in the absence and presence of neurostimulation.Such predictive modeling, in particular, opens the door to a host of model-based detection and control algorithms that have been developed and studied for decades in engineering [7][8][9] but often need accurate predictive models to succeed.
Despite decades of research, accurate predictive models of an epileptic brain's response to neurostimulation are largely lacking.A large group of studies have developed models that primarily provide explanations for the disorder on a phenomenological level [10][11][12][13].These models have been used for extracting features in seizure detection algorithms [14][15][16][17][18][19][20][21], conducting stability analysis [8,22], and localizing seizures [23].However, these models are not tuned using subject-specific stimulation data and are thus inadequate for robust and predictable closed-loop control design over the diverse set of parameters for neurostimulation [8,9].A second body of work has pursued more accurate and/or subjectspecific models using detailed finite-element modeling of the brain's electromagnetic response to neurostimulation [24][25][26][27].This is often done in conjunction with biophysical modeling and used for the analysis and optimization of the impact of electrode placement and (de)polarization activity of neuronal tissue.
Nevertheless, these models are often not built for and capable of accurate time-series prediction, particularly at a large-scale network level beyond the local stimulation site.
Finally, a third body of work has specifically pursued data-driven predictive modeling of the brain's response to neurostimulation.Multiple studies have used linear autoregressive models to fit the neural activity triggered by neurostimulation [28][29][30][31] while others have opted for nonlinear kernel-based autoregressive modeling [32,33].However, these works have mainly focused on univariate modeling (i.e., modeling the neurostimulation-evoked response at a single measurement source), while modeling the evoked network response has remained more challenging [34][35][36].
Furthermore, a significant portion of these studies relies on simulated data, data from non-human primates, or data in the absence of stimulation.As such, accurate subject-specific predictive models of brain network dynamics under neurostimulation have remained largely lacking.
In this study, we present a rigorous data-driven framework for developing subject-specific network dynamical models of intracranial electroencephalog-raphy (iEEG) dynamics in the absence and presence where multiple stimulation parameters are varied in 7 extensive combinations.We follow a standard system 8 identification approach [37] where models are com-9 pared based on the accuracy of their one-step-ahead 10 predictions (namely, predictions of iEEG measure-11 ments at any sample t based on all the available histo-12 ries of past iEEG and stimulation data up to sample 13 t − 1).This emphasis on predictive modeling is not 14 only essential for closed-loop seizure prediction and 15 suppression, as noted earlier, but also provides the 16 means for studying Granger causality among the in-17 put (neurostimulation) and output (iEEG) variables, 18 as described next.

20
Model.The ability of any computational model to 21 explain data depends heavily on its chosen structure.22 In this work, unless when otherwise stated, we use a 23 bilinear model of the form This bilinear 35 choice of the model structure is motivated by the long 36 history of bilinear modeling in neuroscience [38] as 37 well as our prior work showing the linearity of resting 38 state (i.e., STIM OFF) iEEG [39].We thus depart 39 from linearity towards highly complex models (e.g., 40 deep neural networks) gradually and only to the ex-41 tent supported by data.Bilinear models provide a 42 natural first step in moving beyond linearity [40][41][42] 43 and are thus preferred unless more complex models 44 provide substantial improvement in accuracy, a topic 45 to which we will return later (cf. Figure 6b).iEEG data contains about 300ms of causal his-1 torical dependence, despite significant hetero-2 geneity across subjects and channels.Unlike 3 models with latent states such as linear state-space 4 models [37,43] or hidden Markov models [44], au- Increasing the amount of autoregressive history (L) almost monotonically improves predictive accuracy, but only marginally so beyond 200-300ms.Specifically, models with larger L have larger win rates across subjects and channels (Figure 2b, bar plot), but increasing autoregressive history from 300 to 500ms only improves win rate by 10% (Figure 2b, dashed horizontal line).This trend can also be seen from average prediction error (average normalized MSE across all subjects and channels), which drops significantly when increasing L from 1ms to 50ms and decays fairly notably until about 200-300ms but saturates thereafter (Figure 2b, solid line).
Unlike autoregressive history, the amount of past input history (M ) that has a causal effect on evoked iEEG response varies greatly among subjects and channels (Figure 2a).For about two-thirds of the subject-channels, some (though still heterogeneous) amount of input history improves predictive accuracy, while for other subject-channels inclusion of any amount of input history, even from the immediate past, lowers predictive accuracy.This lack of direct causal effect in the latter group is consistent with the hypothesis that not all channels are directly excited by stimulation, but rather receive the impact of stimulation through complex spatiotemporal interactions with other channels.Such network-mediated stimulation effects may also be small enough to be statistically undetectable compared to baseline fluctuations [39], characterizing a group of subject-channels with no evoked response, whether direct or indirect (see also Figure 5).
Given the relatively fast sampling rate of iEEG (1000Hz), it is also possible that successive input/output samples contain redundant, correlated information and need not all be included in the model.To test this hypothesis, we compared the abovedescribed models with densely-sampled input-output history against sparsely-sampled alternatives where both the input and output are sampled every τ milliseconds (τ > 1).The same number of lags was used across the dense and sparsely sampled models to ensure fair comparisons.We find that at low model lags, Figure 2: Model order distribution and performance across subjects and channels.For each subject-channel, we trained ARX models of various orders (L, M ) (L ∈ {0, 1, 50, 100, ..., 500} and M ∈ {0, 100, 200, 300}) and determined the pair (L, M ) that resulted in the model with the best fit over unseen data.(a) We plotted the Win rate i.e., the percentage of times each (L, M ) pair is the best across all considered channels (n = 1547).We used 2 definitions of the best model: based on the mean ((L, M ) pair that resulted in the lowest MSE) and based on the median ((L, M ) pair that is superior to other model orders as per Wilcoxon signed-rank test (see Methods)).Irrespective of the definition used, we saw that larger autoregressive lags (i.e., L ≥ 400) are favored across more than 90% channels.About 66% of the modeled channels favored a non-zero input lag (i.e., M > 0) and hence are causally affected by direct stimulation input.(b) We also plotted the validation MSE (averaged across all M ) as a function of the autoregressive lags (L) (left y-axis) and found that although a larger L is more favored, doing so only resulted in a marginally better fit in comparison to a model with 200 to 300 autoregressive lags.In the same figure, we also plot (right y-axis) the 'statistical win rate', defined as the percentage of channels where a model of order L has an MSE that is statistically similar (i.e., within ±2S.E.M ) to the MSE corresponding to the best model.From this plot, we observed that a model with L = 300 is sufficient for nearly 90% of all the channels.sparse sampling is more beneficial indicating that the 1 inherent process is more dependent on dynamics over 2 longer time horizons as opposed to highly correlated 3 dense samples.With increasing lags however (Fig- 4 ure 3), dense models achieve higher predictive accu-5 racy, even though sparse models with the same num-6 ber of lags have access to a wider window for pre-  ages the use of delay embedding [46] despite its theoretical appeal [47], and instead encourages invest-11 ing the available computational power (which is often  For each channel in the data, we considered 2 categories of models: ARX models with dense temporal sampling (i.e., iEEG lags are spaced 1ms apart) and ARX models with sparse temporal sampling (i.e., iEEG lags are spaced τ > 1ms apart).For the latter, the exact sampling time τ was chosen separately for each channel based on delay embedding theory using the method described in [48].We trained sparse and dense models of various lags (L ∈ 1, 50, 100, 150, 200, 250, 300) and plotted the normalized MSE across subject-channels as a function of L. Although the sparse ARX model with L lags has access to a much larger duration of iEEG history (i.e., τ Lms) in comparison to a dense model (i.e., Lms), the former has lower accuracy.Thus, temporally short-range models with higher sampling rates are more predictive.
iEEG evoked response is best described by switched-linear dynamics.The functional form, extent, and even the very presence of nonlinearity in the neural dynamics of patients with epilepsy has been the topic of much debate [49,50].In our recent work [39], we showed that resting state (interictal, STIM OFF) iEEG dynamics are best described by linear autoregressive models.However, our examination of STIM ON and STIM OFF periods reveals notable distinctions in the optimal autoregressive parameters (i.e., matrix A k in Eq. ( 1)) when fit separately for each duration.Therefore, we first sought direct evidence for whether the presence of stimulation makes iEEG dynamics nonlinear.We tested if matrix A k trained using resting state data (i.e., STIM OFF) can generalize over STIM ON durations.We did this by comparing 2 linear models across each subject: (1) an ARX model trained over STIM ON duration (Stim Model in Figure 4a) and (2) the Stim Model with the A k matrix replaced with that trained using STIM OFF data (RS Model in Figure 4a).If the dynamics in the STIM ON and STIM OFF durations were indeed similar, we could expect the above 2 models to perform similarly when we test them on unseen data.Moreover, an agreement between these models would indicate that the iEEG dynamics can be captured largely by a single linear model.However, our analysis in Figure 4a and 4b showed that the Stim Model performs significantly better on un- even some (about 10%) of subject-channels are best described by a single linear model throughout all STIM ON and STIM OFF durations.The distributions of win rates for each subject also show a consistent preference for switched-linear models across all subjects, even though vast heterogeneity also persists in these distributions, both within and between subjects (Figure 5b).
A further aspect of the large heterogeneity in the channel models in Figure 5 is whether stimulation has a direct causal effect on iEEG evoked response.We observed that stimulation effects are predominant in more than 75% cases across subject-channels.This observation was further validated when we conducted a bootstrapping experiment (see Methods) over the input lags and found empirical evidence for direct stimulation effects in about 95% of the 75% subjectchannels considered in the test.
Despite the higher predictive accuracy of switchedlinear models compared to linear ones, their binary gating enforces a single linear model for all STIM ON durations, irrespective of the stimulation's frequency and amplitude.A fully bilinear model with U(t) = [U M (t)] ⊺ in the form described by equation Eq. ( 1) can potentially alleviate this issue by allowing the stimulation waveform to directly modulate the linear dynamics using the interaction term A third, middle-ground alternative is to employ an amplitude-weighted switched-linear (AWSL) model where the gating term U(t) is determined by the amplitude of the stimulation rather than a binary signal.However, we found the switched-linear model to be more consistent with the data compared to the more complex bilinear and AWSL alternatives (Figure 6).This is remarkable not only from a biological perspective but also from an engineering perspective given the large body of literature on control design for switched linear systems [51].
While we notice the lack of benefit in bilinear over switched-linear models, this does not remove the possibility that significantly more complex models still explain stimulation-evoked iEEG response better.
When compared against some of the most commonly known nonlinear models, switched-linear models were   1), where we refer to models 9 with at least one nonzero D i k (at least one nonzero 10 effective connectivity between two electrodes) as vec-11 tor autoregressive (VAR).In our recent work [39], we 12 found VAR models to be less, not more, predictive of 13 resting state iEEG compared to models with no net-14 work interactions (all D i k = 0), indicating a lack of 15 causal effects among the channels at rest.

16
The presence of stimulation, on the other hand, 17 has the potential to boost causal network interactions 18 through which the stimulation effect can spread.As 19 noted earlier (cf.Figure 2a), across all subjects, we 20 found no direct causal effect from stimulation input 21 to about 25% of the channels.Nevertheless, several 22 of such channels still demonstrate clearly heightened 23 activity during STIM ON periods, therefore hinting 24 at an indirect causal effect of stimulation mediated 25 by network interactions (Figure 7).
Assuming that the spread of the stimulation 27 through the brain connectome is the main reason for 28 the significance of network effects in STIM ON du-29 rations compared to resting state periods, then the 30 stimulation channel itself should have no causal net-31 work effect from other channels.Therefore, we hy-32 pothesize that response at the site of stimulation can 33 be best explained by a switched-linear model without 34 network interactions (i.e., switched ARX).This was 35 indeed the case when we compared switched VARX   Analyzing the channels corresponding to electrodes 12 that were directly stimulated, we consistently found 13 that these non-stimulated channels most often receive Channel importance as a function of distance indicates an S-shaped relationship.
Evoked iEEG dynamics are highly consistent across stimulation frequencies but less consistent across sessions.A critical challenge for model-based control design for neurostimulation is learning models that generalize beyond the (past) data over which they are trained, particularly given the extreme inter and intra-subject variability in epilepsy on the one hand and scarcity of stimulation data on the other.One aspect of such generalizability is across responses to different stimulation frequencies, which is not trivial given that various stimulation frequencies elicit unique inhibitory and excitatory responses [36].We performed a large-scale combinatorial experiment wherein we trained models for each channel using a dataset with k of the 6 stimulation frequencies (including 0Hz, i.e., STIM OFF durations) and validated the models on a test dataset containing all the 6 frequencies.Figure 9 shows the normalized MSE for different values of k, averaged across all channels and 6 k possible models for each value of k.We found that using a training dataset with a higher variety of stimulation frequencies proves advantageous in learning a more generalizable model.In particular, the inclusion of 0Hz stimulation (STIM OFF data) in training data resulted in a better model.Given the abundance of such STIM OFF recordings across various datasets, our result indicates that such data can be leveraged to learn more accurate models of stimulation-evoked response models.Additionally, the saturation of the MSE with just 6 frequencies indicates that it is possible to learn a predictive model with a relatively safe and simple experimental design.
Another important aspect of the generalizability is that across temporally-spaced sessions.If brain dynamics were to be stationarity across different sessions, combining the data directly from these sessions should result in a superior model compared to fitting a model using data from only one session.We thus compared 3 types of switched-linear models over 2 different recording sessions: a model trained on Session 1, a model trained on Session 2, and a third model trained on both sessions.When these models were tested using unseen data from one of the sessions (Session 1), we found the model trained on We found that including more frequencies results in a more generalizable model.(b) However, we did not find it to be beneficial when we extend the dataset using recordings from multiple sessions.Specifically, when we considered 2 sessions and the 3 possible models that can be trained from them (using only session 1 or only session 2 or both in the training set), we found that unseen data from session 1 was best predicted by the model trained on data from Session 1 only.(* used when comparison p − value < 0.05) data only from that session to be the most predic-1 tive.This was in fact the case in all 13 subjects.This    have been shown to be able to differentiate between 30 SOZ and non-SOZ regions [7,53] as well as pre-ictal, onset, and ictal durations [8].
A rather undesirable aspect of iEEG recordings in general is the presence of stimulation artifacts [36] and epileptic discharges.From the perspective of understanding brain function, it is common practice to remove any such effect, often conservatively, and only consider the "clean" data for modeling and analysis [54,55].However, when modeling the brain for the purpose of closed-loop detection and control, we need models that accurately capture all the effects and elements that appear in the control loop, including all 3 durations (pre-STIM, STIM ON, post-STIM) and regions (SOZ and non-SOZ, gray and white matter).We can no longer afford to disregard numerous channels and recording durations, as often happens during common iEEG preprocessing.Therefore, we adopt a minimal pre-processing pipeline (cf.Methods) and delegate the task of handling complexities in the data to the data-driven model.
An important aspect of our findings is the vast heterogeneity among optimal model structures (let alone parameters) among subjects and even channels within the same subject.This is understandable considering the variations in input effects and the diverse network influences across channels, and only reinforces the need for individualized analysis and treatment in epilepsy [56].We even observe sufficient heterogeneity among the dynamics of different recording sessions from the same individual to prevent a model tuned to one session from properly generalizing to another.This is conceptually similar to the well-known lack of stability of intracranial multiunit activity recordings across sessions [57] but is observed at a significantly larger spatial scale where the physical drift of the recording electrodes relative to the surrounding tissue is less to blame.Nevertheless, across scales, such non-stationarity necessitates adaptive detectors and controllers with constant re-tuning of models and algorithms [58,59].
Our modeling framework is similar to Dynamic Causal Modeling (DCM) [60] in a number of respects.Bilinear families of dynamic models have a long history in DCM, and switched-linear models with a STIM ON/OFF switching signal are conceptually similar to works that build separate DCM models for STIM ON and STIM OFF periods [41,61].Nevertheless, we depart from DCM in a number of ways, including a prediction-error-based parameter estimation methodology [37] instead of the variational Bayes methodology used in DCM [62].Furthermore, DCM typically uses single-lag bilinear models for fMRI [41] and complex neural mass models for EEG [63], whereas we show the power of bilinear autoregressive models with long input-output histories in explaining EEG dynamics.These modifications not only make the resulting models more predictively accurate but also significantly more scalable to large-scale recordings.Similar hybrid modeling approaches have also been done for autoregressive models in works such as [40] and provide a broad rationale for the models described in this paper.
In the context of system identification [37], the experiment design process holds significant importance for and essentially limits the achievable accuracy of any downstream modeling.A good experiment design process involves probing the system with a persistently exciting or a diverse set of input frequencies in order to record all the possible dynamical modes of the system [37].This may even be more important for the brain, as it has been suggested that different stimulation frequencies can result in distinct types of responses, with lower frequencies (10-60 Hz) typically inducing inhibitory effects and higher frequencies (100-200 Hz) producing excitatory responses [36].Thus, it becomes crucial to incorporate a diverse range of stimulation parameters before modeling.Nevertheless, the conventional approach to experimental design in human studies [38,40] often entails repeatedly applying a single, fixed-parameter stimulation.This will allow the models to simply "memorize" the brain's response to that specific stimulation, without necessarily learning anything generalizable about the brain's response to neurostimulation.The RAM Parameter Search recordings used in this study are one of very few that encompass a wide range of stimulation frequencies, amplitudes, and durations.However, this dataset is not ideal either, e.g., due to a lack of recordings where changes in stimulation parameters occur without going through a STIM OFF phase.While recent works such as that in [34] have attempted to construct more parameter-rich experiments for non-human primates, similar efforts toward modeling the human brain are largely absent.
Nevertheless, the ability to learn predictive models using the RAM dataset, as described in our work, suggests that it is feasible to achieve successful modeling using a relatively small yet carefully chosen set of stimulation parameters.tive Memory (RAM) project, consisting of multichan-14 nel resting state, task-induced, and neurostimulation-15 induced iEEG responses of over 80 human sub-16 jects [64].As reported extensively in the RAM docu-17 mentation, the highest stimulation amplitude (A saf e ) 18 that could be applied to each subject without result-19 ing in after discharges were first measured.This value 20 was then used in the PS2 experiments which we use 21 in this work, providing subjects with 500ms fixed-22 duration stimulations of biphasic pulse trains deliv-23 ered at varying combinations of current amplitudes (3 24 levels: A saf e , A saf e −0.25mA and A saf e −0.5mA) and 25 frequencies (single pulse, 10Hz, 25Hz, 50Hz, 100Hz, 26 or 200Hz).Each subject has undergone multiple 27 stimulation sessions, ranging from 2 to 5, with each 28 session involving a different stimulus (anode-cathode) 29 location as shown in Figure 1.

30
During each session, pulse trains of each combina-31 tion of amplitudes and frequencies were repeatedly 32 delivered multiple times (about 20-30 times, equiva-33 lent to 30 minutes of total experimentation).Such 34 exhaustive repeated parameter sweep data is rare in 35 stimulation-evoked iEEG and makes this dataset par-36 ticularly rich for the modeling purposes of this study.

37
Preprocessing and data decomposition.As 38 noted earlier (see Discussions), we used minimal pre-39 processing in order to build models that can capture 40 stimulation-evoked iEEG dynamics in its full com-41 plexity.We removed power line harmonics by ap-42 plying 4th-order Butterworth notch filters at 60Hz, 43 120Hz, and 180Hz.We then detrended the signal us-44 ing a first-order polynomial fit to remove linear drifts 45 and z-scored the data to standardize signals across all 46 subjects and channels.Finally, we selected windows 47 of length 1500ms around each stimulation.These 48 consisted of 500ms pre-stim, 500ms stim, and 500ms 49 post-stim for pulse train stimulations and 750ms pre-50 stim and 750ms post-stim for single-pulse stimula- in this work are listed in Table 1.
Parameter sets of linear and bilinear/switched linear (BL/SL) models used in our work.Note that bilinear and switched linear models have the same set of nonzero weights and only differ in their encoding of U (see 'Preprocessing and data decomposition' above).In each case, the shown matrices are learned from data while matrices outside of this set are assigned to 0.

35
Determining the structure of Eq. ( 1) involves choosing appropriate model orders (L, M , and P ) as well as determining the subset of {A k , B k , C k , D k } that can be nonzero.We make all such choices in a data-driven manner.This is particularly important when we want to fit personalized models, where we need to account for the knowingly-large subjectto-subject variability [56,65].Given the compound and combinatorial nature of structural choices, we also use a hierarchical bottom-up approach where we tune low-level quantities (e.g., number of autoregressive lags) before selecting higher-level choices (e.g., type of nonlinearity or presence of network interactions).For each channel, we first assume an ARX model and find the pair of iEEG and input lags (i.e., L and M ) that maximizes the model accuracy.Fixing the obtained values of (L, M ), we train the models described in Table 1 while keeping the value of the maximum network lags P as 1.Once we statistically determine the best model structure (i.e., linear or bilinear/switched), we use it to determine the optimal value of P by evaluating the validation MSE of models fit with different values of P (increased from 1 to 100 in steps of 10).
In addition to the above models, we trained other nonlinear models, as follows: Artificial Neural Network (ANN): A 2-layer feedforward network with ReLU activation, with 500 and 50 nodes in the first and second layer respectively, and inputs being Y L k and U M .We opted for the ReLU activation function due to our observation that it yielded superior performance compared to alternative nonlinear functions, such as tanh.
Long Short-Term Network (LSTM): Standard single layer LSTM with y k (t − 1) and u(t − 1) as input and expected output ∆y k (t) Sparse Identification of Nonlinear Dynamics (SINDy) using 3 rd degree polynomial expansion Nonlinear ARX (NLARX) specified in [66], model order (50, 50, 1) used for all channels Learning.We train the free parameters of each model by minimizing the regression mean squared error The tuning of the delay time τ k has been studied ex-6 tensively in the delay embedding literature [46].In 7 this work we select a separate optimal τ k for each sub-8 ject and channel using the auto embedding method 9 described in [48].

21
We then tested the two ARX models {A k } performing as good as {A (2) k }.However, if the process is nonlinear, the latter would perform significantly better because of having both {A k } and {B k } trained on STIM ON data.
Bootstrap Test.For each subject and channel, we used a bootstrap test to validate if the presence of input lags significantly improved the prediction accuracy (i.e., whether a direct causal effect from input to that channel existed).From the original dataset D = {Y k , U, Y −k , ∆y k } for channel k (where Y −k corresponds to the iEEG information from all channels other than k) we created a synthetic one D = {Y k , Ũ , Y −k , ∆y k } by randomly shuffling the input history vector U (.) across time samples (i.e, Ũ (i) = U (j) for some random j).We then split this dataset into training and test sets, learn a dynamical model, and compute the corresponding test MSE as before.We repeated this process 100 times and compared the test MSE of the unshuffled model to the distribution of test MSEs from shuffled models to compute a sample-based p-value.We reject the null hypothesis of no input effect (i.e., conclude that a direct causal effect from input to that channel existed) if p < 0.05.
In addition to input effects, we also validated the causal effects of network interactions in each subjectchannel using a similar bootstrap test, wherein we generate synthetic datasets for each channel by randomly shuffling the iEEG past lags of other available channels.Specifically, a synthetic dataset D = {Y k , U, Ỹ−k , ∆y k } was created by randomly shuffling the network information history vector Y −k (.) across time samples (i.e, Ỹ−k (i) = Y −k (j) for some random j).The test for network effects follows the same experimental structure as described in the previous paragraph.
Learning models on subsets of stimulation frequencies.The results reported in Figure 9 were generated as follows.
We first split the dataset for each subject based on the stimulation frequency during each window and created multiple training sets corresponding to all subsets of {0 (no stim), 10, 25, 50, 100, 200}.For each subset of frequencies, we fit a switched ARX model and tested it on the complementary subset of frequencies omitted in the training set.The normalized MSEs for all the combinations are reported in Figure 9.

1 Introduction 1
Closed-loop neurostimulation has long been consid-2 ered a promising alternative to highly invasive re-3 section surgery for individuals with drug-resistant 4 epilepsy.Extensive research over the past decades 5 has highlighted its numerous advantages, including 6 short-term benefits like electric stimulation-induced 7 desynchronization of neuron firing to counteract8

12
of medial temporal lobe in patients with epilepsy.We use recordings from the Restoring Active Mem-3 ory (RAM) Public dataset, which offers a particu-4 larly rich resource for data-driven modeling due to the 5 presence of factorial parameter search experiments 6

Figure 1 :
Figure 1: (a) Illustration of the iEEG data used in this work for a sample subject (see Methods for details).Each cluster of adjacent blue circles corresponds to contacts on a single electrode (warped due to registration to standard space).u(t) encodes the applied stimulation pulse train, including both its frequency and amplitude.y k (t) refers to the recorded iEEG signal of the k th electrode ordered based on its closeness to the stimulation site.

5
Figure 2a shows how autoregressive models with

7 diction ( τ
Lms for a sparse model compared to Lms 8 in a dense model).In particular, this result discour-

12 extremely
limited in chronic implants) into learning 13 temporally short-range predictive relationships.

Figure 3 :
Figure3: Effect of sparser temporal sampling of autoregressive lags on model accuracy.For each channel in the data, we considered 2 categories of models: ARX models with dense temporal sampling (i.e., iEEG lags are spaced 1ms apart) and ARX models with sparse temporal sampling (i.e., iEEG lags are spaced τ > 1ms apart).For the latter, the exact sampling time τ was chosen separately for each channel based on delay embedding theory using the method described in[48].We trained sparse and dense models of various lags (L ∈ 1, 50, 100, 150, 200, 250, 300) and plotted the normalized MSE across subject-channels as a function of L. Although the sparse ARX model with L lags has access to a much larger duration of iEEG history (i.e., τ Lms) in comparison to a dense model (i.e., Lms), the former has lower accuracy.Thus, temporally short-range models with higher sampling rates are more predictive.

Figure 4 :Figure 5 :
Figure 4: Evidence of nonlinear dynamics and switching behavior.We tested an ARX model trained using STIM ON (Stim model) data against a similar model with the A k matrix replaced by that trained using STIM OFF data (Resting State (RS) model).(a) The performance (MSE) difference between these models and (b) their respective win rates, indicating the presence of stimulation-dependent nonlinearity.(c) When we account for the nonlinearity using a switched ARX model, we found it to be more descriptive of the iEEG response in comparison to a single ARX model

Figure 6 :
Figure 6: Comparison of switched-linear models with other forms of nonlinear models.(a) We compared the switched-linear model against an amplitude-weighted (AW) switched-linear model and a fully-bilinear model.As indicated by the win rate across, we found switched-linear models to be most predictive across all subject-channels (b) When we compared switched-linear to other forms of nonlinear models, we found that the artificial neural networks (ANNs) resulted in the lowest MSE and this result was followed by the switchedlinear model (* used when comparison p − value < 0.05).
highly predictive across channels, second only to feedforward Artificial Neural Networks (ANNs) with ReLU activation.Note, also, that ANNs with ReLU activation are also switched-linear models.In fact, when compared against ANNs with tanh nonlinearity, those with ReLU activations explained the data significantly better.Nevertheless, compared to a simple switched-linear model in Eq. (1), the ANNs with ReLU activation have two orders of magnitude more parameters, combinatorially more switching regions, and almost no biological interpretability.This makes the former, simpler switched-linear model potentially preferred for mechanistic understanding and modelbased closed-loop seizure stabilization, particularly in chronic implants with limited logic gates and battery

Figure 7 :perspective as well as a mechanistic understanding 6 of how stimulation effects spread throughout the 7 brain. Network interactions are captured by the term 8
Figure 7: Heightened activity in channels without direct causal effect from the stimulation.(a) Plot showing the STIM ON durations within a specific stretch of the recordings.In (b)-(d), we plot the iEEG activity in response to the stimulation signal in (a) for 3 channels that did not show any evidence of being directly affected by the applied stimulation.

1
and switched ARX models trained for stimulation 2 channels, and found that the latter model was favored 3 in 10 out of the 13 subjects in our data.Even for the 3 4 subjects where switched VARX was favored, we found that the relative improvement in MSE from incor-6 porating data from other channels was only modest 7 (< 10%).Interestingly, this is consistent with the role 8 played by the do-operator in causal discovery[52], ef-9 fectively isolating the anode from being causally in-10 fluenced by other nodes within the network.

Figure 8 :
Figure 8: Strength of network interactions as a function of distance from the stimulation site.The strength of causal effects from other nodes (channels) in the network to each node is measured by the MSE advantage M SE scalar − M SEvector, where the former (latter) comes from a model in which the history of other nodes' iEEG values is not (is) used for one-step-ahead prediction.(a)-(d)The MSE advantage for each channel as a function of its distance to the stimulation site (shown in blue) across 4 different subjects.We found that the distance-based moving average of this advantage (shown in red) is initially close to 0, rises with distance until about , and then decreases thereafter.Plots for the remaining 9 out of 13 subjects have been provided in Supplementary Figure4

Figure 9 :
Figure 9: Generalizability of trained models across stimulation frequencies and sessions.(a) We plotted the normalized test MSE as a function of the number of stimulation frequencies used as part of the training set.We found that including more frequencies results in a more generalizable model.(b) However, we did not find it to be beneficial when we extend the dataset using recordings from multiple sessions.Specifically, when we considered 2 sessions and the 3 possible models that can be trained from them (using only session 1 or only session 2 or both in the training set), we found that unseen data from session 1 was best predicted by the model trained on data from Session 1 only.(* used when comparison p − value < 0.05)

2
finding thus indicates the presence of non-stationary 3 or time-varying behavior across sessions.Therefore, 4 models trained over a prior session cannot be directly 5 deployed in a new session, but have to be re-trained 6 continuously to make sure they remain predictive.

8
In this study, we presented a data-centered frame-9 work for modeling the iEEG response induced by 10 intracranial neurostimulation.We highlighted the 11 effectiveness of switched-linear autoregressive mod-12 els with about 300ms length of history in accurately 13 representing seizure-free iEEG dynamics throughout 14 the pre-STIM to post-STIM duration.Furthermore, 15 we elaborated on efficient methodologies for training 16 such models without requiring supplementary infor-17 mation.We analyzed the models and channels from 18 the standpoint of causal effects of stimulation, both 19 direct and indirect.In addition to the modeling, 20 we experimentally characterized the generalizability 21 of the system identification process across different 22 stimulation frequencies and recording sessions.

23
Our focus on iEEG in this work stems from its im-24 portance and gold-standard role as a feedback signal 25 for closed-loop intracranial neurostimulation.In the 26 context of epilepsy, iEEG has long been used for the 27 localization of the , and subnetworks extracted purely 28 from iEEG data using unsupervised machine learning

Future 11 Data.
steps towards the development of a complete model-based closed-loop seizure control system require even richer and more extensive datasets that encompass simultaneous seizure and neurostimulation, enabling models to learn the interactions between ictal and stimulation-evoked dynamics.Such data, however, is challenging to collect for a number of reasons.Despite ongoing advancements in seizure detection, accurately and reliably timing stimulation relative to seizure onset remains a major challenge, 1 which is further exacerbated by the heterogeneity of 2 seizure events even within the same individual.Fur-3 ther, iEEG recordings have significantly better spa-4 tial and temporal coverage and algorithmic flexibility 5 during acute seizure monitoring, but manipulating 6 intrinsic seizure dynamics can interfere with medical 7 diagnostics at this stage.Animal and computational 8 models can therefore play an invaluable role before 9 algorithmic designs can be translated to humans.Throughout this work we use data from the 12 Parameter Search experiments of the Restoring Ac-13

1 tions.
This was due to the fact that only about 1/6 of 2 the recordings corresponded to durations where stim-3 ulation was applied.Hence, using the full recordings could (and did, based on our pilot analyses) bias the 5 models towards fitting STIM OFF durations and ig-6 noring evoked responses.7Wedefine the input time-series u(.) based on the 8 stimulation timing and parameters.If a pulse train 9 with frequency f and amplitude a was applied start-10 ing from time t k (always lasting for 500ms), then we 11 set u(t k +∆t) = a for all ∆t ∈ [0, 500] that are integer 12 multiples of 1000/f and u(t) = 0 otherwise.13 Unless otherwise stated, we split the processed 14 iEEG recordings into training and test as follows.15 Each 1500ms window (corresponding to each instance 16 of stimulation, see above) was split into 5 continuous 17 folds of data (300ms each), out of which 4 randomly 18 selected folds were used for training, and the remain-19 ing fold was used for testing.20 Models.The main focus of our work is the dynamic 21 model described in Eq. (1), which is used to make 22 one-step ahead predictions about the iEEG response 23 given past iEEG and input history lags as regression 24 features.We denote the iEEG history of length L 25 for channel k at time t as the vector Y L k (t) = [y k (t − 26 1), y k (t − 2), ..., y k (t − L)] ⊺ .Similarly, the input lags 27 vector U M (t) of length M is given by [u(t − 1), u(t − 28 2), ..., u(t − M )] ⊺ .29 Several families of models used in this work are 30 special cases of the model in Eq. (1).A linear au-31 toregressive model is obtained by setting the matrices 32 B k , C K , and D k to 0. The learnable (not necessarily 33 zero) parameter sets of the subclasses of Eq. (1) used 34

10
Indirect test of linearity.The results demon-11 strated in Figure4were obtained as follows.For 12 each channel, we divided the iEEG dataset into 3 13 sub-datasets, with the first one including only STIM 14 OFF data while the other 2 contained only STIM ON 15 durations.Using the first dataset, we trained an AR 16 model with parameter matrices {A (1) k }.The second 17 dataset was used to learn an ARX model with paramtrained on dataset 2 by fixing autore-20 gressive weights to {A

( 1 )
k } and only learning {B k } on the third dataset.If the underlying 23 process is indeed linear, we expect the autoregres-24 sive weights ({A k }) to remain the same during STIM 25 ON and STIM OFF durations, which would result in the ARX model {A

9
30in that same channel, the past M samples U M (t) 31 of stimulation input, the past P samples Y P i (t) of 32 iEEG in other channels, and the bilinear interactions 33 U(t)Y L k (t) between the stimulation input and iEEG 34 output (cf.Methods for details).
) 25 to model individualized (subject and channel-26 specific) iEEG response to stimulation.In this model, 27 the one-step change in the iEEG activity of the k th 28 channel, i.e., ∆y k (t) = y k (t + 1) − y k (t), is predicted 29 linearly based on the past L samples Y L k (t) of iEEG